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Abstract 


A pattern recognizer is usually a modular system, which consist of a feature extrac- 
tion module and a recognition module. Traditionally, these two systems have been 
designed separately, which may not result in optimal recognition accuracy. Speech 
recognizer is one such system where the feature extraction module is designed us- 
ing expert knowledge. In our approach, we apply a Genetic Algorithm to evolve a 
feature extractor based on the fitness evaluation for phonemically tagged data. A 
feature extractor can be evaluated based on its abihty to map instances of different 
classes into different regions of the feature space. A measure of goodness can be as- 
sociated with a feature extractor using class dissimilarity measures taking phonemes 
as classes. Different feature extractors can be produced by varying the boundaries of 
the filters of the filter-bank of a Mel Scale Frequency Cepstrum Coefficient (MFCC) 
like feature extractor. We use class separability measure as a fitness function and 
filter-bank as an evolving element to run a Genetic Algorithm based optimization. 
We have achieved significant accuracy increase on average using evolved feature 
extractor over MFCC. 
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Chapter 1 
Introduction 


The goal of a speech recognition systena is to accurately and efficiently convert a 
speech signal into a text message independent of the device, speaker or the environ- 
ment. 

Naturally, one of the most promising approaches to speech recognition is to inves- 
tigate the human process of speech perception and to apply the scientific findings to 
the development of a recognition system. This human-science-based approach has 
not yet achieved comprehensive analysis of human mechanism, though the partial 
findings have been used in the cmrent systems. The modular design of the system 
is also encoruaged by the human-science-based findings. First a feature extraction 
module generates features from given speech signal and recognizer module uses these 
features to classify the input signal to output the recognized hypothesis. 

The design of a feature extraction module is highly inspired by human perception 
process, while the recognizer is designed using different machine learning models. 
For feature extraction, a Mel-frequency cepstrum coefficient (MFCC) method is most 
widespread in such systems. Continuous research in this field for approximately 
four decades has taken the recognition accuracy nearly 100%. This progress can 
be largely rewarded to better speech modeling techniques. Today’s state of the art 
systems contain HMM based models which have been foimd to be the best model 
for speech[18]. This is because of the fact that HMM can implicitly handle the 



variations in the length. 


Today, there are many speech recognition systems available in both research and 
commercial domain. SPHINX, HTK and ISIP are examples of open source sys- 
tems. While Dragon Systems by ScanSoft and ViaVoice from IBM are examples of 
commercial system available. 

1.1 Motivation 

In spite of such a progress in recognition accuracy, automatic speech recognition 
(ASR) systems stiU have a long way to go before they can be used for routine input. 
The main difficulties arise due to: 

• Change in the environment of operation 

• High degree of noise 

• Change in the channel characteristics 

• Speaker variations 

• Change in the microphone characteristics 

Another problem with current systems is high memory and computational require- 
ments. It is also important to note that these problems are correlated, for example 
decreasing the memory and computation requirements degrade robustness of the sys- 
tem. Such trade-offs compel us to fine tune the system for a particular application. 
Researchers axe now focusing on improving the memory and computation aspects 
while maintaining system performance. There are many approaches a solution. One 
obvious direction would be to analyze the Human Speech Recognition(HSR) Sys- 
tem. Studies show that the perception process is not only utihzing the acoustic and 
phonetics knowledge but also higher level of knowledge such as semantics^ context 
and pragmatics. Unless we incorporate this information, we will not be able to at- 
tain HSR performance. If we carefully think of the content of the signal, it contains 
the following information. 
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• Speaker Information (SI) 

• Signal Message (SM) 

• Environmental Characteristics(EC) 

One of the causes of the HRS lies in its ability to separate different information 
for different purposes. Like, listening speech segment, we would be able to say that it 
was a message “come here” from a person X and TV was playing in the background. 
If we are able to find such a mechanism to extract a piece of information relevant to 
the intent, it would be a big step towards making speech recognition a real working 
system. Work in the speech modeling domain has seen tremendous amount of work 
giving high accuracies and it seems it has reached a performance ceiling. Only better 
features for recognition can improve the situation any further. Hence our work is 
focused on studying this problem. 

We need to find feature set with lower dimensionality and richest discriminant 
information. Decreasing the dimensionahty while keeping the discriminant informa- 
tion would yield reduced memory and computation requirements. This is because 
the means and variances vectors takes less memory and computation of the proba- 
bihty of a frame with respect to a state reduces also. 

1.2 Problem Definition 

In this thesis, we have focused on the process of extraction of optimal features from 
the speech signal that would help the onward recognition process decide the correct 
hypothesis. The goodness of a feature extraction system is determined based on 
the feature space generated. A Genetic algorithm has been used as an optimization 
mechanism to obtain the best features for the task. 
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1.3 Organization of thesis 


The organization of the thesis is as follows. Chapter 2 provides an introduction to 
speech recognition. It first provides the mathematical formulation of the problem 
and describes the problem as a Pattern Recognition task sphtting the whole process 
into two main parts: Feature Extraction and Classification. Focus is given to var- 
ious feature extraction methods where we describe the filter-bank based feature 
extraction in detail. 

Chapter 3 contains some of the concepts used in our work and previous work done 
in this domain. These are Genetic Algorithm and Discriminative Feature Extrac- 
tion. A general formulation of a genetic algorithm and its major components have 
been described. It also reviews the work done in the domain of feature extraction 
learning for speech recognition. It mainly consists of the kinds of feature extractors 
considered and methods to obtain the goodness estimation of the same. 

Our formulation of the filterrbank optimization problem using a genetic algorithm 
is described in Chapter 4. Chapter 5 describes the experiments done and the results. 
Finally, chapter 6 concludes providing some future direction for further work. 
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Chapter 2 

Fundamentals of Speech Recognition 


The basic problem of speech recognition is to find the best word sequence corre- 
sponding to the speech signal. The word sequence comes from a set of possible word 
sequences, called the language L. Also the words are coming from a closed set called 
the dictionary D. The language can contain infinite number of sentences based on 
the grammar provided. The task of the recognizer is to find a vahd word sequence 
present in the given language L. Speech is a continuous time analog signal. This 
kind of signal can not be directly processed by digital systems. Hence, sampling 
and quantization is performed to transform the input continuous speech signal into 
a discrete signal with quantized amplitude. A pre-processing system(also known as 
feature extractor or front end processing) transforms this into a sequence of feature 
vectors. 

Broadly, we can divide the speech recognition area into two branches[5]: 

• Isolated Word Recognition(/lFi2) 

• Continuous Speech Recognition(C’5i?) 

In IWR, the recognizer takes as input an observation sequence of any utterance 
at a time (spoken in isolation and belonging a fixed dictionary) and outputs the 
word which has the highest probability corresponding that observation sequence. In 
these systems, the speech containing the words can be easily isolated and separately 
recognized. But speech communication used in the real world has a different pattern. 
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Here the word boundaries are not well separated as in the case of isolated words. Also 
the segmentation of the speech signal into different words is difficult and sometimes 
considered impossible. The frame work of IWR can not handle these complexities 
and we need to employ statistical techniques. These systems fall into the realm of 
CSR systems. It is important to note that even though IWR has a very limited 
recognition power it has many applications in the real world. 

2.1 Major Approaches 

The problem has been attacked using two major approaches: Template matching 
based systems and Stochastic Modeling based systems. Of these, the stochastic 
modeling approach is currently the dominant technique as it can be scaled to large 
vocabularies. 

Template Based Approach This is an approach used for IWR systems. As its 
name suggests, the system has a prototype representation Oi of every word Wi E D. 
An unknown observation sequence X will be compared to each prototype and the 
one with the least distance is given as the recognized word. Here the choice of the 
distance function is very critical to system performance. The fact that the lengths of 
the observation sequences are different makes the distance measure more complex. 
A dynamic programming based algorithm, Djnamic Time Warping(DTW)[17], is 
usually used to obtain the distance. 

Stochcistic Approach Let W € L be a sentence such that W = Wi,W 2 ,wz, ■ . - Wn 
where Wi E D.The task of the speech recognizer is the following. Given an obser- 
vation sequence X corresponding to sequence of words W, output the sentence W 
that, according to some criteria, best accounts for the observed sequence X. Let 
Pr{W/X) be the probability that the sentence W was spoken, given the observa- 
tion sequence X has been observed. Let Pr{ W) be the apriori probability that the 
sentence W is spoken. Then the recognizer should pick the sentence W such that, 

Pr{W) = max Pr{W/X) (1) 

w 
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Using Bayes’ rule, the right hand side of the Equation 1 can be rewritten as, 

^ , Pr{X/W)PT{W) 

^ ^ w Pr{X) ■ ^ ’ 

Since the maximization in Equation 1 is over a fixed X, we have from equation 
1-2 a reduction of the problem to determining a sentence W such that, 

W = argmaxwPr{X/W)Pr{W) (3) 

The equation 3 is the basis of stochastic or statistical speech recognition system. 
This equation leads to the modularization of the system as shown in figure 2.1. One 
of the modules is the feature extractor. It is also known as the Front-end subsystem 
of the ASR. The observation sequence X is the output of this system. We need to 
find tools which would provide Pr(W) and Pr{XfW). We can see that Pr{W) is 
dependent on the characteristics of the language L. Similarly Pr(X/W) connects 
the word sequence W and the observation sequence X. These methods fall in the 
realm of Language Modeling and Acoustic Modeling, respectively. The following 
sections describe each subsystem in detail. 

Acoxistic Phonetics 

In a spoken language, a phoneme is a basic unit of sound that can distinguish 
words (that is, changing a phoneme in a word, produces another word having dif- 
ferent meaning). In that way, it is defined as the smallest contrastive unit in the 
sound system of a spoken language. Phonemes are not physical sounds, but ab- 
stractions. Phonemes are regarded as a set of phones, that are regarded as a single 
sound, and represented by a common symbol. An allophone is one of the several 
phones that belong to the same phoneme. A phoneme has different acoustic real- 
izations (allophones) based on the phonetic context. The common notation used in 
linguistics employs slashes(//) around a symbol representing a phoneme and square 
bracket ([]) around a symbol representing an allophone. For example in English, 
words cat and rat each have three phonemes, /kaet/ and /raet/. A pair of words 
that are identical, except for a single phoneme are known as a minimal pair, [p^] as 
in pin and [p] as in spin are two allophones of /p/. 
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Figure 2.1: General Speech Recognition System 


The acoustic representation of the phoneme can vary significantly with speaker 
and the context of the phoneme (place within utterance, neighboring phonemes), 
that is the phoneme-acoustic mapping is not one-to-one. The same phoneme displays 
various acoustic representation based on the speaker, the context and background 
noise. The articulation of one acoustic unit has its effect on the next unit to be 
articulated. This problem in known as coarticulation. To counter the effect of 
coarticulation, one needs to include context information while designing the base 
units. It is also a design decision as to how much context is put in the base imits. 
Following are some of the base units. 

• Biphones 

In Biphones, one previous or next context is used to define base units giving 
rise to Left-context Biphone and Right-context Biphones respectively. If the 
phoneme set P is of size N than the Biphones would yield a base set of size 
N^. 
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Figure 2.2: Components Contributing to Speech Signal 


• Diphones 

A diphone is defined as the last half of one phoneme i and first half of phoneme 
j where phoneme i and j occur consecutively. This kind of unit is used to model 
the transient behavior at the junction point of two phonemes. 

• Triphones 

A triphone is defined by its left and right context. That is, ijk where i, j,k G P 
represents a triphone which has a central phoneme j preceded by i and followed 
by k. This is the most widely used base unit set in speech recognition as it 
nicely captures the coarticulation effects. One of disadvantage with using 
triphones is the size of the set which is N^. 

There is always a trade off between the amount of context taken and recognition 
accuracy. Increasing the context would increase the accuracy but it also increases 
the base set size. It is normally seen that the triphone is an optimal solution. 

2.2 Feature Extraction 

Featme extraction is perhaps the fundamental problem in pattern recognition be- 
cause it is the first step in building a recognizer. The purpose of a feature extractor 
is to identify, with in the data, what information is needed to perform accurate 
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Figure 2.3: Role of Feature Extraction in Speech Recognition and Speaker Recog- 
nition 


classification. The feature extraction process is expected to discard irrelevant in- 
formation to the task while keeping the useful one. The following properties are 
required for a good featiure extractor: 

• Compact features to enable real time analysis 

• Minimize the loss of discriminant information 

A feature extractor for ASR has a specific task to perform. As shown in figure 
2.2, the speech signal contains the characteristic information of the speaker (SI) and 
environment (EC) in addition to signal message(SM). An FE for speech recognition 
needs to maximally discard the SI and EC information and only allow the SM 
information to pass, on the other hand an FE for speaker recognition task needs 
to filter the SI information from the speech signal. Figure 2.3 shows an ideal FE 
for speech and speaker recognition task. The ability of FE for speech recognition 
improves depending upon how well SI and EC are filtered out: 

• SI Speaker Independence 

• EC Noise Robustness 
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Mathematically, it is a transformation from the input space(cE>s) to feature space($ 2 :). 


(4) 

Here equation 4 shows that Fg is a transformation function parameterized by 
6. Hence, variation in 0 would yield different featiure spaces. Let the set of such 
spaces be x- The following are some interesting properties of this set that can be 
investigated. 

• Is there a feature space which separates the target class completely? 

• Which is the most discriminating feature space for a discrimination measure? 

In this thesis, we provide a mechanism to answer such questions. After an 
abstract view of a feature extractor module, let us now see specific feature extractors 
in use in current recognition systems. The speech signal is processed in frames with 
frame size ranging from 15 to 25 milliseconds and an overlap of 50%-70% between 
consecutive frames as shown in figure 2.2. Hence, the speech is processed on a frame- 
by-frame basis. The overlap between two consecutive frames is necessary in order 
to account for the possibility of a spht of an acoustic unit. There are two major 
methods to extract features from each frame. 

1. Linear Predictive Coefl5cients(LPC) - Temporal Features 

2. Filter-bank based cepstrum features - Sjjectrum Features 

In the LPC method, we try to approximate the speech segment by a linear equation. 
The amplitude of the signal at time n+1 is determined using previous {k+1) samples 
already seen. These coefficients a are computed by minimizing the prediction error 
on the entire speech segment. 

r: (n -h 1) = akX (n) + ak-ix (n — 1) -I- ak-2^ (n - 2) -h ... 4- oqx {n — k) (5) 

The other more prevalent method is filter-bank based cepstrum features. 
This method of feature extraction would be described in detail as further optimiza- 
tion experiments are done on this method. Figure 2.5 shows a block dia,gram of a 
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Figure 2.5: Filter-bank based feature extraction 

filter-bank based feature extractor. The processing in each block is explained below. 
Here the speech signal is s(n). 

1. Blocking and Windowing 

Here the speech signal s(n) is segmented into overlapping blocks and a window 
function is applied to obtain a windowed signal. This windowing operation is 
performed to reduce the discontinuities at the boundaries of the signal segment. 
Let the length of the segment be N. 

2. Fourier Transform 

As the filtering operation is apphed in the frequency domain, the signal is 
transformed to frequency domain using Discrete Fourier Transformation(DFT). 
DFT transforms the signal from discrete time domain to discrete frequency do- 
main. By discrete frequency domain, we mean that the frequency domain is 
sampled into finite number of points(F). DFT is computed efficiently in prac- 
tice using Fast Fourier Transform(FFT) algorithm. This algorithm is requires 
the input signal length to be power of 2. Hence the signal is zero-padded to 
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make its length power of 2. Further, the squared magnitude of the spectrum 
vector is taken to obtain power spectrum vector x = {a;i, . . . , x/, . . . , xp}. 

Any time-frequency transformation can be used in this block, but in almost 
all the systems only Fourier Transform is used. The other transformation that 
is now being applied is Wavelet Transformation. 

3. Filter-Bank Based Energy Extraction 

In this block, the energy of the signal in different frequency bands is obtained 
for further processing. This task is performed by a filter corresponding to a 
frequency band. A filter can be defined as a mechanism to pass or suppress 
energy contained in certain bands. We will use the implementation of a filter 
in the frequency domain only. The filter can have different shapes triangular, 
rectangular, Gaussian etc. depending upon the requirement. A filter F with 
any shape can be represented using a F dimensional weight vector. 

F = {wi\i = (6) 

The energy of such a band can be obtained by, 

F 

Vj = ^ XiWi (7) 

i=l 

A filter-bank is a set of filters(say N). Hence the output of this block y is a 
energy vector of length N. 


y = ( 8 ) 

where yj corresponds to filter. 

4. Cepstrum Generation 

Here a Discrete Cosine Transformation(DCT) is applied to a log-compressed 
energy vector producing cepstral coefficients as shown below. 

7r(2n — 1)(A: — 1) . 

Vncos— — = (9) 
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DCT has a very good energy compaction property transforming most of the 
signal energy to the first few coefficients. Therefore only first few(typically 12) 
coefficients are taken to form the feature vector dropping the rest. 

Filter-Bank 

Here we see that the filter-bank is a crucial parameter in the transformation of 
the signal to the feature space. The filter-banks are generally designed using the 
knowledge of the human auditory system. The human auditory system processes 
the signal in various frequency bands with linear distribution in the initial part of 
the frequency range and becomes non hnear as we go to the end of the frequency 
range. To model such a system, two popular frequency scales namely Mel and Bark 
have been suggested. Mel scale was introduced by Davis and Mermelstein[7] in 
1980 when they combined triangular filters perceptually placed with log-compressed 
filter output energies. The original scale contained 20 filters, first 10 linearly placed 
between 100 to lOOOHz, next 5 log-spaced between IKHz and 2KHz, and 5 log- 
spaced between 2KHz and 4KHz. Since than other researchers tried to provide 
modifications to the original filter-bank out of which the scale function(equation 10) 
provided by Fant[10] is used in most MFCC systems. 

/ = 2595Zo^io(l + ^) (10) 

On this scale, filters are equally spaced with each bandwidth determined by the 
frequency Ta,nge{fmin, fmax) and munber of filters N. Thus; 

A ? fmax fmin 

^ ~ N + 1 

fci = fmin+i^f (11) 

where is the central frequency of the filter. In a triangular filter-bank, the 
left{fif) and right(/rj) frequency is the center frequency of the previous and next 
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Linear Scale 


Figure 2.6: Mel frequency scale approximation proposed by Fant 


filter respectively. They are defined as shown in the following equations. 


2 J fmin 

i = 1 


l<i<N 

fn = ( f'"' 

1 < i < iV 

Jmax 

i — N 


( 12 ) 

( 13 ) 


2.3 Classification 


2.3.1 Acoustic Modeling 

In this subsystem, the connection between the acoustic information and phonetics is 
established. The connection can be established either at word or at phoneme level 
gives rise to word based system and phoneme based system, respectively. In both 
cases a speech unit p is mapped to its acoustic counterpart using temporal models 
as speech is a temporal signal. There are many models for this purpose like, 
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Figure 2.7; A three state left-to-right HMM 


• Hidden Markov Model(HMM) 

• Artificial Neural Network(ANN) 

• Dynamic Bayesian Network(DBN) 

ANN is a general pattern recognition model which found its use in ASR in the 
early years. Rabiner[18], in 1991, first suggested the HMM approach leading to 
substantial performance improvement. Current major ASR systems use HMM for 
acoustic modeling. Since than, researchers have tried to optimize this model for 
memory and computation requirements. In the current state, it seems that HMM 
has given the best it could and now we need to find other models to go ahead in 
this domain. This lead to consideration of other models in which D 3 mamic Bayesian 
Network seems a promising direction[27]. As our experiments and verification has 
been done on a HMM based system, we have a detailed look at HMM. 

Hidden Markov Model The basic idea of HMM is that the observation sequence 
X is generated by a system which exist in one of a finite number of states. At 
each time step, the system makes a transition from the current state to the next 
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state while emitting an observable quantity according to a state specific probability 
distribution. Precisely speaking, a hidden markov model M is a four tuple consisting 
of the following. 

1. A set of possible states Q 

2. A transition matrix A where is the probability of making a transition from 
state Qi to state qj 

3. A state conditioned probability distribution over observations, that is a spec- 
ification for Pr{x/qi) for any observation x parameterized as B 

4. An initial state probability tt 

The observation sequence modeled by the HMM may be discrete or continuous in 
nature, but the state space can only be discrete. This model has a very well studied 
mathematical background of Markov Processes. Basically, an HMM is a first order 
Markov process with state emitting observations. Figure 2.7 shows a three state 
HMM with all the transitions either going forward or loop back to itself. This 
kind of HMM is called a left-to-right HMM which can naturally model the speech 
signal. There are various kinds of HMM topologies which impose restrictions on the 
transitions allowed between states. 

There are three basic problems related to HMM all of them important to speech 
recognition. The algorithms to solve these problems are the important places for 
improvements as they are at the core of HMM based speech recognition. 

1. The Evaluation Problem 

Given an HMM M and an observation sequence X = Xi,X 2 ,Xs, ... ,xt, what is 
the probabihty that the observations are generated by the model, Pr{X/M)7 

2. The Decoding Problem 

Given an HMM M and an observation sequence X = xi,X2,X3, ... ,xt, what 
is the most likely state sequence that produced the observations? 
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3 . The Learning Problem 

Given an HMM M and an observation sequence X = x 1,0:2, 2:3, ... ,a:r, how 

should we adjust 

The problems are addressed by the following algorithms: Forward Algorithm, Viterbi 
Algorithm and Baum- Welch Algorithm respectively. We will not go into the 
details of these algorithms. A detailed discussion of the same is available here[ 18 ]. 

To see, how it can be applied to ASR, we will study an isolated, whole-word 
isolated word recognizer. In this system, there is a HMM M, for each word in the 
dictionary D. HMM Mi is trained with the speech samples of word Wi using the 
Baum- Welch algorithm. This completes the training part of the ASR. At the time 
of testing, the unknown observation sequence X is scored against each of the models 
using the forward algorithm and the word corresponding to the highest scoring model 
is given as a recognized word. 

2.3.2 Language Modeling 

The goal of language modeling is to produce accurate value of Pr(W). A language 
model contains the structural constraints available in the language to generate the 
probabilities. Intuitively speaking, it determines the probability of a word occurring 
after a word sequence. It is easy to see that each language has its own constraints 
for validity. The method and complexity of modeling language would vary with the 
speech application. For example, a simple speech enabled call dialing system which 
would have a very limited vocabulary and constrained input will have a simple lan- 
guage model. On the other hand, the task of transcribing broadcast news data would 
require a large vocabulary of the order of thousands with sentence structure that 
is much less constrained. This leads to mainly two approaches for language mod- 
eling as described below. The appropriateness of the approach is problem specific. 
Generally, small vocabulary constrained tasks like phone dialing can be modeled by 
grammar based approach where as large applications like broadcast news transcrip- 
tion require stochastic approach. 
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Figure 2.8: Search graph generated by grammar 


Grammar Based Models 

In this approach, the structure of the language is dej&ned in term of grammar rules. 
Generally, this grammar is a context free grammar. Hence, from the grammar, we 
can generate a search graph where the nodes will be the words in the dictionary. 
For example, the search graph generated by the grammar shown below is shown in 
the figure 2.8. Hence, a path from <sent-start> to <sent-end> would correspond 
to a correct hypothesis. 

SENTENCE = <sent-start> NAME <sent-end> 

NAME = madhav | nishit I nityanand 


Statisticsd Lainguage Models 

Here, the PriW) is determined using the chain rule as shown in Equation 14. 

n 

Pr{W) = ][J Pr{wi/wi-u Wi- 2 , ■ - - , Wo) (14) 

2=1 

Here, the probability of a word Wi occurring in a sentence is dependent on all the past 
words. It is very apparent that this would require a large number of probabilities 
to be estimated. The number of probabihties to be estimated is exponential in 
terms of vocabulary size. Therefore an approximation to this method is required. 
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’his can be achieved by truncating the past dependencies which would give rise to 
quation(15). 

n 

Pr(W) = Pr{wi/wi-i, Wi- 2 , • • • , Wi-N+i) (15) 

i=l 

Commonly used values for N axe trigram iV = 3, bigram N = 2 and unigraxn 

^ = 1 . Such language models need huge data from the corresponding language 

o estimate the probabihties. It easy to see that as we increase the number N, the 
lata requirement increases exponentially. A trivial method to get these probabilities 
rom the data is to obtain word frequencies. As the amount of data increases the 
ipproximation in estimating the probabihties would tend to the actual values. The 
)robabihty information for a trigram model can be estimated using the equation 
>elow. 


Pr{WilWi^2,Wi-\) = f{Wi/Wi-2,Wi-\) = C{Wi-2.Wi^l,Wi)IC{Wi.2,m-\) (16) 

vhere C{-) represents number of occurrences of the corresponding word sequence. 
There is a major problem in using raw frequencies as probabihties. Any word com- 
oination, which is not present in the training corpus, would be assigned probabiUty 
v^alue zero. Therefore it is necessary to smooth trigram frequencies using the follow- 
ing equation. 


Pr{wi\wi-2-, vJi-i) = As * f{wi\wi-2, Wi-i) + X 2 * f{wi\wi-i) + Ai * f{wi) (17) 

where nonnegative weights Ai + A 2 + A 3 = 1. The other approach used to alleviate 
this problem is clustering the words into different classes. Then the probability 
estimation is done on different classes instead of words. The actual word probabilities 
are then derived from the class probabilities. 

2.3.3 Decoding Algorithm 

As we can see from equation (3), the problem of finding the correct sentence is an 
optimization problem, once Pr{W) and Pr{X/W) are known. The behavior of 
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the ASR system heavily depends on the decoding algorithm as that would mainly 
determine the response time of the system. The language model would induce a 
search graph containing nodes as words and links between them representing the 
probabilistic relationship between them. Here we note that the computation of 
Pr(W) is independent of the observation sequence X, where as Pr{X/W) has to 
be computed at run time using the model parameters. Decoding problem is to 
find the best path in the search graph. Therefore the algorithm involves has to 
maintain a set of active paths (paths which can possibly reach the sentence end) and 
keep expanding them until the end of speech is reached. After that, the hypothesis 
corresponding to the path having the highest probability is given as the recognized 
hypothesis. A path would contain a sequence of states of the graph. One way of 
search is to keep paths of the same length, that is each path is scored against last 
speech frame obtained. This is called Time Synchronous decoding. Most widely 
used Time Synchronous decoding is the Viterbi algorithm. The other way is to keep 
expanding the path with highest probabihty. When the probability of the best path 
drops, we need to select other path in the hst and apply previous frames which 
were read and buffered to reach current frame. This is called Time Asynchronous 
decoding. Stack based decoding algorithm are modification of A*-algorithm which 
falls into Time Asynchronous category. 

This chapter provided a brief overview of the speech recognition process and tools 
used for that purpose with emphasis on Feature Extraction. An exhaustive survey 
of the speech recognition method and their evolution has been provided in [13, 19]. 
In the next chapter, we introduce the basic concepts used in this thesis and discuss 
previous work. 
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Chapter 3 

Previous Work and Preliminaries 


.1 Optimization of Features for pattern recogni- 
tion 

ptimization of feature extraction systems to obtain improvements in the classifica- 
on accuracy is a well-studied field in pattern recognition. Tiraditionally, the features 
r a pattern recognition task are defined using expert knowledge. This approach of 
^signing a feature extractor has been successfiil for some pattern recognition tasks, 
owever, as the problem becomes more and more complex, the design of features 
scomes more difficult because of the increase in the dimensionality and classifica- 
on complexity. Let us take an example task of designing a feature extractor for 
is recognition that is classifying a type of flower into the classes, let’s say A and 
. The trivial features that come up are color, length of the petal, blooming period, 
ad number of petals. A simple analysis of the spread of the samples of two classes 
ith respect to different features would yield the feature set required. The set of fea- 
ires for which the classes show up weU clustered and non-overlapping can become 
Q ideal feature set. Prom this recognition task, let us go to a task of recognizing 
human face. Here also there would be many candidate featiues present but the 
imensionality of such task makes it intractable for manual analysis. 
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Speech recognition is one such problem where knowledge of human auditory sys- 
em has been utilized while designing its feature set. Processing of speech through 
various nonhnear frequency bands by auditory system gave rise to nonlinear fre- 
[uency band energy based feature set with bark and mel scale. Various researchers 
ike Fant[10] and Slaney[22] have suggested modifications of these scales reporting 
mprovements in the recognition accuracy. These features enabled speech recogniz- 
es transcribe speech with high accuracy in constrained conditions. But these system 
iegrade drastically with change in speaker characteristics and environmental condi- 
;ions. The most probable reason for such a behavior can be drastic variation in the 
■eatures extracted in the changed conditions. This situation demanded for featrues 
vhich are robust to these changes. 

This problem motivates the use of labeled data to obtain optimal features. There 
are several ways to use the labeled data in pattern recognition in general and speech 
recognition in particular. We briefly discuss some approaches here. 

Artificial NeuraJ Network(ANN) 

ANNs have proved their ability to solve difficult classification problems. Their 
ability to adapt with a robust training algorithm makes it one of the obvious meth- 
ods to obtain data dependency in pattern recognition. The ANN/HMM hybrid 
approach for speech recognition is an example where ANN is being used as a feature 
extractor. In this approach, the training task comprises of learning the HMM pa- 
rameters through Baum- Welch algorithm and simultaneously updating the neural 
network weights to generate better features. A state of the art survey for this ap- 
proach is provided by Edmondo Trentin et. al.[24] 

Transformation of Feature Vectors 

In this approach, the featiues are first extracted using the conventional feature 
extraction system and than a transformation is applied to obtain enhanced features. 
The transformation p is generally from high dimensional space R" to a lower dimen- 
sional space RP maintaining the discriminative information reducing redundancy. 

0(r) = pO{t) (1) 
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■n above equation, a n-dimensional feature vector 0{t) is transformed into a p- 
iimensional vector 0(r) using an p x n dimensional transformation matrix p. One 
>asic method for linear transformation is to apply Principal Component Analysis 
PCA)[8] which is an unsupervised method as it does not require class information. 
This method finds an orthogonal coordinate system in which the axes axe ordered 
iccording to decreasing variance. The first few coordinates tahe up most of the 
.variance present in the data. Hence projection of the original data on the newly 
generated truncated coordinate system 5 delds dimensionality reduction. Linear Dis- 
criminant analysis is one standard method in pattern recognition which takes the 
class information into account. Here the transformation p can be found using any 
class scatter metric as an optimization criteria. Xun 3 dng Liu[14] has used linear 
cransformation based optimization using modified LDA. 

Parameterization eind Optimization of Feature Extractor 

In this approach, the feature extraction system is parameterized and then opti- 
oaization is performed to obtain the parameters. A formahsm for such an approach 
(vas first provided by Biem et ah [2] [3] when he used this approach to obtain features 
for the speech recognition task. This paradigm assumes that the feature extractor 
and recognizer axe both parameterized as 9 and p respectively. During the training 
phase, both the parameters axe learned together. This method is applied to small 
speech recognition tasks with a filter-bank based feature extractor treating the filter- 
bank parameters as the parameters to optimize. The aim of the experiment is to 
obtain a better filter-bank which can than be used for a larger recognition task. 
The method used for optimization is generally gradient based using an objective 
function which tends to simulate the recognition system. The method is mostly ap- 
plied to filter-bank based feature extractors because of its widespread use in speech 
recognition systems. 

Biem[3] employed vowel classification as a recognition task. The center fragment 
of each vowel was used to generate 256 FFT-based spectral coefficients through a 
hamming window. A neural network was used as a classification mechanism. The 
classification output obtained from the neural network was used to compute the 
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ibjective function used in the gradient descent search. 


Hermansky et. aL[15] use LDA to optimize the filter-bank. Here the within class 
,nd between class covariance is used as a class scatter metric. They have reported 
mprovement in the recognition accuracy through optimized features. They also 
irovide an analysis of the shape of the filters obtained which could lead to an 
inderstanding of the type of spectral changes that carry phonetic information. 

The bandwidths of the filters of the filter-bank also have an effect on the dis- 
riminative ability of the features which can be conclude firom Mark[21]’s work. In 
ds work, the bandwidth of the mel filter-bank is modified giving out new features 
ailed HFCC. Here also the within class and between class measure is used using 
)honemes as classes. One of the problems here was the truncation of the phoneme 
.t the boundaries to make the length of aU the phonemes equal. This loss of length 
aformation can lead to improper optimization as the length is also a critical para- 
aeter used by the human perception system. Therefore a mechanism to generate 
ionstant length representation from variable length data is required. Varying the 
liter parameters also has an effect on the robustness of the system to noise. This 
;an be concluded firom the work of Torre[6]. 

The previous work reviewed above suggests that more exploration in filter-bank 
)ased feature extraction can lead to improvements in speech recognition systems and 
lelp us create a system which can work in an unconstrained environment. In our 
vork, we try to optimize the filter-bank using a different optimization mechanism 
ind objective function. All the previous work uses gradient based optimization 
vhich put constrains the definition of the objective function. We use an evolutionary 
ipproach, Genetic Algorithms, which does not require the objective function to be 
lifferentiable. The objective function is defined using criteria similar to within class 
ind between class variances with phonemes used as classes. But no truncation 
s performed. The features obtained for phoneme samples are transformed into 
»nstant length features using time series formulation methods. 
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3.2 Genetic Algorithm as an Optimization Method 

In this section, we briefly review the evolutionary technique on genetic a algo- 
rithms (GA). GA is a search and optimization method which mimics the natural 
and chromosomal processing in natural genetics. These algorithms try to mimic the 
natural process of evolution to solve an optimization problem. In some other op- 
timization problems, the search space is astronomically large prohibiting the brute 
force approach to find an optimal solution. Many of the problems, involving optimiz- 
ing the feature extractor parameters having real values, lead to search spaces that 
are infinite. The classical optimization methods like gradient based optimization 
have been used to solve such problems. One major drawback of classical methods 
is the requirement of a difllerentiable objective function. Some problems have a 
non-differentiable objective function that makes classical methods inapphcable. 

GA simulates the survival of the fittest among individuals over consecutive gen- 
eration for solving a problem. Each generation consists of a population of character 
strings that are analogous to the chromosome. Each individual represents a point in 
a search space and a possible solution. The individuals in the population are then 
made to go through a process of evolution. 

GA is based on an analogy with the genetic evolution. The basic principles are: 

• Individuals in a population compete for resources and mates 

• Those individuals most successful in each ‘competition’ wifi produce more 
offspring than those individuals that perform poorly 

• Genes from ‘good’ individuals propagate throughout the population so that 
two good parents will sometimes produce offspring that axe better than either 
parent 

• Thus each successive generation will become more suited to their environment 
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Figure 3.1: Genetic Algorithm 


Search Space 

A population of individuals is maintained within a search space for a GA, each 
representing a possible solution to a given problem. Each individual is coded as 
a finite length vector of components, or variables, in terms of some alphabet. To 
continue the genetic analogy these individuals are likened to chromosomes and the 
variables are analogous to genes. Thus a chromosome (solution) is composed of 
several genes (variables). A fitness score is assigned to each solution representing the 
abilities of an individual to ‘compete’. The individual with the optimal (or generally 
near optimal) fitness score is sought. The GA aims to use selective ‘breeding’ of the 
solutions to produce ‘offspring’ better than the parents by combining information 
from the chromosomes. 


28 









The GA maintains a population of n chromosomes (solutions) with associated 
fitness values. Parents are selected to mate, on the basis of their fitness, produc- 
ing offspring via a reproductive plan. This mating plan consists of the sequence 
of genetic operators to be apphed to the current population to generate the next 
generation. Consequently highly fit solutions are given more opportunities to re- 
produce, so that offspring inherit characteristics from each parent. As parents mate 
and produce offspring, room must be made for the new arrivals since the population 
is kept at a static size. Individuals, in the population, die and are replaced by the 
new ones, eventually creating a new generation once all mating opportimities in the 
old population have been exhausted. In this way it is hoped that over successive 
generations better solutions will thrive while the least fit solutions die out. 

New generations of solutions are produced containing, on average, better genes 
than a typical solution in a previous generation. Each successive generation will 
contain better ‘partial solutions’ than previous generations. Eventually, once the 
population has converged and is not producing offspring noticeably different from 
those in the previous generations, the algorithm itself is said to have converged to a 
set of solutions to the problem at hand. This entire process is shown in figure 3.2. 

3.2.1 Genetic Operators 

Here, we describe the function of each operator. 

Selection 

This operator is used to select the parent elements from the current population 
pool. The key property of this operator is the abihty to give more preference to 
better individuals to allow them to pass better genes to the next generation. Here 
the goodness of each individual is dependent on the fitness assigned by the objec- 
tive function. A trivial selection operator would select the best N individuals for 
reproduction. But this may not be the best strategy as weak individuals might also 
posses some good genes that, if overlooked, would never come into the best solution 
set leading to a sub-optimal solution. 
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Figure 3.2: Crossover operation 


Crossover 

This is an operator which induces diversity in the population. Two individuals 
from the current population(t) are chosen using the selection operator and a random 
point of crossover is generated. Now the portion of the chromosome up to the 
randomly generated point is swapped between two individuals giving rise to two new 
individuals. These newly generated individuals are put into the next generation pool 
as shown in figure 3.2.1. The crossover between parent individuals is performed with 
probability Pc, which is also called crossover rate. That is, in 100pc% of f fie crossover 
operation, new offsprings are put in the next generation. Rest of the time, parent 
individuals are directly passed with any change. Intuitively, we are recombining the 
portions of chromosomes of good individuals which are likely to produce even better 
individuals. 

Mutation 

The function of this operator is to randomly change part of the chromosome with 
low probability (mutation rate-pm)- The purpose of this operator is to maintain 
the diversity of the population and to avoid premature convergence. This operator 
alone would induce a random walk through the search space. The implementation 
of this operator is heavily dependent on the kind of chromosome being evolved. For 
example, this operator may simply fiip some bits of a binary coded GA, while for 
chromosomes having real elements the task is to perturb some of the values using 
random numbers. 
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There are two types of GA being used. They are different in terms of how the 
new population is generated from the previous one. One option is to generate new 
offspring and put them into the next generation without considering the parent gen- 
eration which is done in Simple GA. The other way of generating the next population 
is to merge the parent and child population and then select the required number of 
individuals from the entire pool which is done in SteadyState GA. The advantage of 
this method is that the ‘good’ individuals from the parent generation which failed to 
have an effect in the current process can remain in the pool for the next generation. 
In our implementation, we use SteadyState GA. 



Chapter 4 

Design of GA Based Feature 
Extraction Optimization System 


In this chapter, we will describe the design of a genetic algorithm based feature 
optimization system. The design of such a framework can be broken into two major 
aspects. 

1. Optimization Method 

2. Objective Function 

Any optimization method searches for the best solution in the search space induced 
by the parameter set. In the feature extraction process, the feature extraction 
method would define a search space. A filter-bank based feature extractor would 
induce a set of feature spaces each corresponding to a specific filter-bank. As the 
filter-bank contains real parameters, the set is infinite. 

As described in the previous chapter, the three most important aspects of using 
genetic algorithms are: 

1. Definition of the Genetic Representation 

2. Definition of the Genetic Operators 

3. Definition of the Objective Function 
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Defining the above three aspects completely define the optimization method. Before 
each aspect is discussed in detail, an abstract classification task is defined. 

Classification Task 

A recognition task consists of M classes . . . ,0.^'^ , . . . Ni represents 

number of samples present for class Let be the sample of class. The 
task is to classify an unknown sample into one the M classes. The class definition 
can be any one of phoneme, syllable or word. The feature set optimization process 
searches for the most discriminative feature set with respect to these classes(fii) 
based the samples provided. 

4.1 Genetic Representation 

In filter-bank based feature extraction, the filter bank is a set of filters extracting 
the energy of the signal in the corresponding firequency band. Therefore, it can be 
said that the induced feature space is highly dependent on the set of the filters. 
A chromosome is defined as a sequence of such triangular filters. A triangular filter 
is represented using three frequencies: 

• Left firequency - a 

• Center Frequency - ^ 

• Right Frequency - 7 . 

As we have noticed earlier that a filter-bank is applied on a discrete firequency 
domain, we can think of the continuous frequency domain being partitioned into 
finite number number of bins. Hence the edge-frequencies of the filters are specified 
in terms of bin number instead of absolute firequency. For example, if the firequency 
domain is represented by 512 points, (20,45,50) is a valid tuple representing a filter. 

A filter bank is a sequence of such filters. Let us say the number of filters is N . 
Hence, in our case, the parameter set FB = {Fiji = 1, . . . , N} where F, is a 3-tuple 
{oiii PiT'Yi)- A filter would represent an element of a chromosome. 
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Figure 4.1: A filter bank 


4.2 Genetic Operators 

4.2.1 Initialization 

The initialization of the population can be done in different ways. One way is to 
initialize all the individuals of the population randomly placing the triangxilar filters 
on the frequency axis. The other way of doing the same is to place filter-banks in 
the region near to the known filter-banks like Mel/Bark scale. The idea is to find 
the optimal filter bank which could be close to Mel/Bark scale in solution space or 
it might be some other filter bank. Hence, we will be initiahzing the filter banks 
using randomly perturbed Mel/Bark scale in our optimization experiments. The 
figure 4.1 shows a sample filter-bank generated from a Mel filter bank. 

To initialize the individuals of a population, first a mel-scale based filter-bank(F’B,nez 
with N filters is generated as described in Section 2.2. Then this filter-bank is per- 
turbed as follows: 

1. randomly choose number of filters to be modified 

2. randomly choose the filters to be modified 
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3. change the edge-frequencies of each filter selected in small neighborhood 

In the last step, the filter edge-frequencies(Q:) is changed to a random number in 
[a — md,a -1- md] where md is the maximmn possible deviation. The order of the 
frequency edges (left edge < center edge < right edge) must be maintained in order 
to have a well formed filter-bank. Therefore, after the perturbation of a filter, this 
property is checked and if it is not satisfied, perturbation is performed again. 

4.2.2 Mutation 

The mutation of a filter bank can be done by varying the filter frequencies randomly 
in a small neighborhood. This operator is implemented in the same way as the 
perturbation of a filter-bank is described in the Initialization operation. 

4.2.3 Crossover 

The definition of this operator is not very different in this case. A random integer 
r € [I, N] is generated to obtain a crossover point. At this point, the filters of 
the two filter-banks are swapped to generate offspring. Let FBi and FB 2 be two 
individuals selected for mating. They are defined as follows, 

= N] 

After applying the operator with randomly selected crossover point r, offsprings gen- 
erated would be, 

FSf «“ = {Ff, ...,F?,F^^i,...,F%] 

It is interesting to note that the filter-bank based optimization problem has a 
very natural genetic representation and simple implementation for the crossover 
operatior. The objective function is the most important aspect which is described 
next. 
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Figure 4.2: Objective Function Evaluation 

4.3 Objective Function 

This function has to check the goodness of a given filter-bank. A trivial way of doing 
this is to use such a filter bank in a feature extraction module of a complete speech 
recognition system and perform the training and testing task to obtain the final 
recognition accuracy. This is would precisely indicate the goodness. The problem of 
doing this kind of evaluation lies in the computational requirements. The training 
task for even a small vocabulary system takes huge computational resources and 
takes considerable time. In the Genetic Algorithm paradigm, there would be thou- 
sands of such individuals to be evaluated in hundreds of generation. This makes 
such an evaluation infeasible to implement. 

To evaluate the goodness, researchers have used alternate methods which are less 
computation intensive. Let us check the properties of such a function. Let O(-) 
be the objective function and Acc{-) is the accuracy of a particular recognition 
experiment for a feature extractor parameter set 9. The ideal objective function 
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should behave according to the following relation. 

O{0i) > 0 (^ 2 ) Acc{di) > Acc{92) (1) 

The standard method to define such an objective function is class separability 
criteria. Various quantitative measure for class separability are Bayes risk, varia- 
tional distances, scatter matrix based measures, Bhattacharya distance and diver- 
gence rate[12]. Bayes risk is the best measure of separabifity of distributions and 
for any feature space it gives the minimum amount of attainable risk. Theoretically, 

Bayes error is the optimum measure of feature effectiveness, but its computational 
complexity restricts its use for measurement purpose. The computational require- 
ment of a method is the most important aspect considered for selection purpose as 
the method is to be executed by a genetic algorithm for many generations. The 
elegant and yet simple way of formulating a criteria for class separability is based on 
within-class and between-class scatter matrices which is widely used in discr imin ant 
analysis in statistics[8]. This analysis is performed using the mean and covariance of 
the class clusters. One disadavntage of such a measure is the fact that these criteria 
do not have a direct relationship to the probabihty of error for the Bayes classifier. 

The other criteria Bhattacharya measure is derived from the Chernoff Bound(upper 
bound of Bayes error) [12]. Also the computational requirement is equivalent to the 
scatter matrix based measure. Hence, in our implementation, Bhattacharya Dis- 
tance has been used. This measure also uses mean and covariance of class clusters. 

To obtain the means and covariances of the samples belonging to a single class and 
to perform the between class analysis, the representation of the samples should be in 
a space where each sample is a point. In case of speech samples (phonemes/ syllables/ words) , 
the length of signal in signal space is varying with speaker, speaker condition and 
context. Therefore the frame based feature sequence extracted firom the signal also 
has variation in length. If we analyze the samples in signal space for their length, 
we can find a range [Tmin,Tmax] for their utterance lengths. There are two options 
to handle this problem. 
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1. Truncation in Signal Space (^g) 

This is the standard approach applied in related work. The sample is trun- 
cated at the boundaries from the center to make its length constant across all 
samples. The short coming of the approach is that there will be loss of critical 
information that may be crucial in a recognition experiment. 

2. Length Normalization in Feature Space(^a;) 

Here, the features are obtained from different length samples. This will pro- 
duce the feature sequences with different lengths. At this point, normalization 
is applied which transforms the samples into a constant length representation. 
We use this approach in our implementation. 

The procedure can be divided into three major operations as shown in the figure 
4.2. Each processing block will be described in the following subsections. 

4.3.1 Feature Extraction 

This processing block takes the labeled speech data and apphes the feature extractor 
using the parameter set provided. The parameter set provided is generated by the 
genetic algorithm. The block is represented with a sample s{n) of each class Ci one 
at a time. Here we use the Frame-Based Filter Bank Feature Extractor explained in 
chapter 2. Therefore the speech sample segment s(n) would result in r frames. Each 
frame would yield a corresponding feature vector of length F. The feature vector 
sequence is described by X = 

Xi,i Xi^2 ■ - ■ ^l,T 
^2,1 ^2,2 - • ■ ^2,t 

^¥,1 ^¥,2 ■ ■ • ^¥,t 

The sequence can also be represented by collecting the vector element through 
time t = 1 to t = r, that is Wf = Therefore the feature 

vector would look like. 
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where Wf = , w/,t, . . . , Wf^r}- 

4.3.2 Feature Vector Sequence Length Normalization 

The length normalization is a crucial operation which is required to maintain the 
properties of the signal while transposing the signal to a constant length representa- 
tion. This problem is classically known as the Time Series Normalization Problem. 
Let us say is the function that maps the sample from the feature space to the 
constant length representation. 


y = 4f(Js:) 

where T is a normaUzed representation of the given sample. 


( 2 ) 
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The task of the function ^ is to represent the given sequence into constant length 
representation retaining the property of the input. Spectral methods are used for 
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this purpose. DFT is generally used to transform the signal into frequency domain. 
For large time series data, the dimensionality reduction is done through this method 
as described in [9, 1]. 

Here, the dimensionality reduction is used to search the time series databases. 
Recently, DWT based methods have been used for the same purpose because of 
their better performance[16]. In all these methods, the transformation is from higher 
dimensional (but constant) to lower dimensional space. The time series in the input 
space have equal lengths and hence they are transformed into other space. In our 
work, we have used another spectral method, DOT, as a transformation method 
because of its energy compaction property. We also tried pol 5 momial regression as a 
method for time series representation, but after some experimentation it was found 
that the DCT method outperforms polynomial regression. 


I Discrete Cosine Transform (DCT) - ‘ij^ocT 


We briefly introduced this transformation in Chapter 2. Some of the nice properties 
of this transformation make it suitable for time series normalization. This trans- 
form is known for its abihty to compact the energy of the signal into the first few 
coefficients of the transform. The image compression standard JPEG[26] is primar- 
ily based on this property. We use the compression ability of DCT for time series 
normalization. The DCT transform equation is given below: 


=i: 


VnCOS- 


1)(^— 1) 




Basically, it provides a one-to-one mapping between the input sequence and the 
transformed sequence having the same length. The inverse transformation IDCT can 
be used to retrieve the original signal without information loss. The compression is 
achieved by dropping some of the coefficients of the transformed vector. Generally 
these coeflnicients contain very less energy which accounts for equally less information 
in the signal. Figure 4.3 shows the original signal of length 1258 at the top which 
is an aperiodic signal. A DCT is applied to this signal and only the first 200 
coefficients (approximately l/6th) are kept making all others zero. The signal at the 
bottom is the reconstructed one. 
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Figure 4.4; Original and Reconstructed Signals 

Figure 4.4 shows the sequence that is very much similar to the one encountered in 
our normalization problem. A general range of sequence length is 15-25 frames. The 
second and third signals are reconstructed from the first 15 and 10 DOT coefficients 
respectively. In our experiments, a normalization length of 15 is used. 

Decorrelation Property 

This transform has another important property which is used in further simpli- 
fication of computations. The elements of the transformed vectors are maximally 
decorrelated. That is the correlation between and element of the output 
vector will have close to zero correlation. 

4.3.3 Class Separability Measurement 

After all the samples have been transformed into a F x C dimensional space I, class 
scatter analysis is performed. In this space, the samples of each class can be thought 
of forming a cluster. We are interested in finding how the clusters are spaced, that 
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is the coherence of the cluster and the overlap that exists between clusters. We can 
notice over here that the variance of the class cluster can give us an idea about its 
scatter in the space giving an estimate of class coherence. Using the mean of each 
class in connection with variance, we can estimate the overlap that exists between 
classes. 

Recalling the previous set up, there are M classes each having 

number samples. Let us say the samples, after undergoing feature extraction and 
length normalization, are represented by Xmj. The class mean and variance 
of each class as follows: 

= Jj- ^ Xm,i (7) 

t=l 


1=1 


( 8 ) 


For two classes and the Bhattacharya measure is defined as, 

DbM = \ +s‘<"]-‘ (/•>'> 


( 9 ) 


The computation involves inversion of the covariance matrix which is still a costly 
operation. This can be decreased by making the covariance diagonal. If the elements 
of the sample vector Xmj are uncorrelated, the corresponding covariance matrix will 
be near to diagonal. DCT based transformation is applied to feature vectors in the 
temporal direction, the output elements in that direction are uncorrelated. Also the 
base vectors are obtained from the cepstrum base feature extraction system which 
performs a DCT based transformation before producing the feature vector(Figure 
4.5). This would imply that the correlation between the feature elements has been 
drastically reduced. Hence, only diagonal covariance matrix can be used instead of 
the full covariance matrix. 

As our problem consists of more than two classes, we need to extend the two class 
measure to multi-class measure. This has been done in the literature[4] using the 
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Figiire 4.5: Decorrelation in measurement space 


a priori probability of the classes considered. Let us say, the a priori probability of 
the class is Then the distance measure for a multi-class measure can 

be given as, 

M M 

Dave = E E 

t=l j=l 

In our case, we assume that all the classes are equally likely. Therefore the Equation 
10 can be simplified using Pr{Q{i)) = jj. Hence the final equation would be, 

.. M M 

= ( 11 ) 

i=l j=l 

The average distance obtained using the above equation is used as the fitness 
measure for the filter-bank provided. 

Discussion 

In this chapter, a GA-based optimization framework has been described using 
filter-bank based feature extraction as search space. It is possible to search another 
set of feature space induced by other feature extraction systems by changing the 
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genetic representation and by defining the operators. The objective function need 
not be changed as far as the feature extractor falls in to the firame-based category 
which is mostly the case. 
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Chapter 5 


Experiments and Results 


5.1 Baseline Recognition System: SPHINX 

For testing purpose, the evolved filter-bank was tested using speech recognition. A 
complete speech recognition system is required which is modrdar enough to facifitate 
changes needed to incorporate the evolved feature extractor. The two open source 
system considered were SPHINX[25] and HTK[23]. SPHINX was chosen for our 
experiment purpose because of its modularity and flexibility. A brief overview of 
the training and decoding systems is provided in this section. 

Sphinx is a set of HMM based speech recognition engines and training programs 
developed at Carnegie Melon University (CMU). It includes the decoding engines 
Sphinx-2, Sphinx-3 and Sphinx-4 and a set of training tools SphinxTrain. Of these 
systems, we are used Sphinx-4 as a decoding system because of its modularity and 
flexibility. It is specifically designed to provide a modular platform for speech recog- 
nition experiments. SphinxTrain is used to train the HMM models. 

An overall architecture of the system is depicted in the Figure 5.1. Each labeled 
element is a configurable module that can be easily replaced allowing researchers 
to perform different experiments with different modules without changing other 
parts. Sphinx-4 provides both, simple and state-of-the-art implementations for each 
module. As with other speech recognition systems, Sphinx-4 has a large number 




Figure 5.1: Sphinx - 4 Decoder Framework 

of parameters pertaining to each module. Also there must be some way to specify 
which implementation has to be used for each module. A Configuration Manager 
is used to perform this task. A different implementation of a front-end can be 
placed into the system without changing any other code or even recompilation. It 
uses a global configuration file containing all the specification of the system. The 
framework is defined using Java Interfaces. 

In our implementation, the front-end module is modified to experiment with 
evolved filter-banks. A default implementation is used for other modules. A de- 
tailed block diagram of the front-end is shown in Figure 5.2. It comprises one or 
more parallel chains of replaceable commumcating modules called DataProcessors. 
This enables the system to use different kinds of feature extractors with different 
combinations. A frame-based feature extractor discussed in Chapter 3 is imple- 
mented in Sphinx as a chain of DataProcessors. In the filter-bank processing block, 
a Mel scale-based filter-bank is used. Hence, this block was modified to process the 
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Figure 5.2: Sphinx-4 Front-end 


input spectrum using any filter-bank specified. The specific filter-bank can now be 
specified in the configuration file. 

5.2 Data Sets 

As the method is data driven, the availabfiity of specific type of data is very impor- 
tant. In our experiments, we have used three data sets. 

1. Prologix Data Set 

This data set contains phonetically tagged Hindi speech data from a single 
female speaker. A phonetically tagged speech data is one which contains a 
mapping between the speech signal and its transcription at the phoneme level. 
This is the main dataset used in our learning experiments as they are performed 
on phonemes. 

Speech data can be phoneticaUy tagged using the forced aligiiment[23] method 
following a manual check of the tagging. First the HMM models axe trained 
on some of the speech data. The HMM models axe built at the phoneme 
level. These models axe than used with an untagged speech signal and its 
corresponding phonemic transcription to find the best match. This process 
is called forced afignment. The tagging generated by this process has some 
misalignments that have to be corrected manually. This data set was obtained 
from Prologix Software Pvt. Ltd. 



2 . ML A Data Set 

This data set contains 44 Hindi words used in a general Pathology Lab Task. 
Each word is spoken by 12 speakers. This data is recorded in an unrestricted 
environment having different kinds of noise. 

3. Hindi Data Set 

This is also an isolated word data set containing 5500 Hindi words spoken 
by 7 male and 3 female speakers. Recording of this data was done in clean 
environment. 

5.3 Hindi Phoneme Recognition Task 

Phonemes are the speech units that are used in speech recognition systems as a 
base model. The acoustic models for higher level speech units are generally built 
combining the acoustic models at the phonemic level. As discussed in Chapter 2 , 
the acoustic models for phonemes are generally built using HMMs. Hence om: aim 
is to search for features that would enable better recognition at the phoneme level 
which would in turn increase the accuracy at higher levels. 

Here the recognition task is to classify the Hindi Phoneme set. These are 48 
phonemes including one silence phone making M = 48. Each class has 40 samples 
extracted from the tagged continuous speech data uttered in different contexts. To 
modify a filter-bank randomly in a small neighborhood in the parameter space, the 
variable md-maximum deviation is set to 7. That is the edge of the filter (a, 13, 7 ) can 
be randomly changed between -7 to +7 frequency bins. The filter-bank consists of 40 
triangular filters spaced between 133Hz to 6800Hz. The values of other parameters 
used in this experiment are shown in Table 5.1. 

The result of the optimization and the corresponding accuracy is shown in Figure 
5.3. The recognition accuracy is obtained using the MLA data set with 10 speak- 
ers used for training and 2 speakers used for testing. The dash-dotted fine shows 
the fitness of the best individual in every population and the solid line shows the 
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Fitness and Accuracy plot for AllPhoneme Experiment 



Figiare 5.3: All Phoneme Experiment Results 
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PsLTEuneter Name 

Value 

Number of FFT points 

512 

Population Size 

100 

Mutation rate 

0.1 

Crossover rate 

0.9 

Number of Generations 

200 


Table 5.1: Parameter values for AllPhoneme Experiment 

accuracy obtained on the MLA data using the filter-bank of the corresponding in- 
dividual. The dotted line shows the accuracy obtained using Mel filter-bank (say 
FBmei) instead. The graph of the recognition accuracy shows oscillatory behavior 
with respect to fitness, though the initial half of the generations, it remains higher 
than FBrrtei- Some of the filter-banks perform 5-7% better than the Mel Scale. 

The envelope of the accuracy graph seems to be decreasing with increase in number 
of generations. This is interesting behavior which can potentially give a new method 
for speaker adaptation. As the GA training is performed using the speech data from 
a single speaker, the filter-banks are moving in a direction in the parameter space 
specific to that specific speaker. More experiments in this direction can provide a 
method which derives a filter-bank specific to a speaker. However, to develop a 
filter-bank for speaker independent recognition, this behavior is not desirable. The 
phonetically tagged data from a range of speakers can be used, instead to alleviate 
this problem. In that experiment, it is expected that the optimization would learn 
a filter-bank pattern which achieves high fitness throughout the range of speakers. 

The filter-banks achieving high accmacy can be used in speech recognition systems 
if the improvement is sustained across different data sets. To check this property, the 
experiments were carried out on the Hindi data set. We have compared the baseline 
FBmei and the filter-bank which is marked in the Figure 5.3, say FBaiiphoneme- This 
filter-bank is generated at the sixth generation of evolution. The training was done 
on 5 speakers and testing was done on 2 speakers. Table 5.2 shows the comparison 
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of the performance of both filter-banks on MLA and Hindi data sets. We notice 
that the accuracy obtained using FBaiiphoneme is higher than FBmei- 


Data Set 

Number of Words 

Recognition Accuracy 

FBmei 

F B allphoneme 

MLA 

44 

85.22% 

93.75% 

Hindi 

20 

75% 

82.25% 

40 

81.25% 

83.75% 

80 

56.87% 

65(62% 


Table 5.2: Comparison of FB^ei and FBaiiphoneme on MLA and Hindi data set 


Figure 5.4 shows both the filter-banks. Most of the filters in these banks are 
similar except some filters have their left or right frequency edge displaced. Figure 
5.5 shows the center frequencies of both the filter-banks. 




Figure 5.4: FBaiiphoneme (fop) and FBmei (botton) 
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Figure 5.5: Center Frequencies of the Best Filter Bank 

In another experiment, only the vowels in Hindi were used as a class definition 
including a silence phone. All the other parameters were kept same. Here also we 
have obtained similar behavior for the accuracy plot. 
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Chapter 6 


Conclusion and Future Work 


6.1 Conclusion 

In this thesis, we have attempted to obtain a feature extractor which retains the 
discriminative information through maximizing the class separabihty criteria. To 
the best of our knowledge, this is the first time genetic algorithms have been used 
for optimization of feature extraction for recognizing speech. The results of exper- 
iments indicate that GA is able to search the space obtaining better filter-banks 
in every generation with respect to a given Objective Function. This proves that 
the operators, mutation, crossover and initialization, have been defined properly to 
facilitate the search algorithm to search the space efficiently. Genetic algorithms 
can be used in such optimization without having any constraints on the objective 
function. Hence more complex objective functions can be used. 

Prom the experiments performed on Hindi vowels and All Hindi Phonemes, some 
filter-banks have been obtained which perform significantly better in terms of ac- 
curacy than the basehne Mel systems. The performance improvement has been 
observed on two data sets, MLA dataset and Hindi dataset. This shows that there 
exist filter-banks which can give better results. Analysis of these filter-banks, the 
spread of the filters and their band widths, can provide interesting information about 
the critical energy bands required. These evolved filter-banks can be used in Hindi 
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speech recognition systems to get the performance improvement. 

6.2 Future Work 

Current work gives indications that more exploration in filter-bank based feature 
extraction can lead to better feature extractors. The effect of noise on different 
kinds of feature extractors has been well studied in fiterature. Generally, this work 
is focused on changing the filter-bank bandwidths[20] to get noise robustness. If the 
GA based approach is apphed to noisy data, the obtained filter-bank is expected to 
be robust to that kind of noise. The other experiment that can be done is to initial- 
ize the filter-banks with different strategies like random initialization, nearly linear 
initialization, mixture of variations of Mel and Bark scale filter-banks. The path 
that the filter-banks traverse and its convergence can give interesting information 
about the filter-banks being used. 

The aspect of memory requirement can also be studied. That is to see how a 
learned filter-bank performs with reduced Gaussians and states per speech xmit. 
The objective function used in our work is giving an oscillatory behavior. More 
exploration on the properties of objective function and its behavior can lead to 
monotonic improvement in the recognition accuracy. 

Optimization in other search spaces 

The optimization framework is able to search for the best feature extraction pa- 
rameters in the search space defined by a specific feature extraction method. In our 
work, we have used Fourier Transform Frame-based feature extraction. Other poten- 
tial method which can be parameterized is based on wavelet transform. A wavelet 
transform is used for multi-resolution analysis of the signal where the stationary 
property of the signal is not required. Therefore this transform can parameterize 
the speech signal without having to assume it is stationary. Also the speech contains 
phonetic information at different resolutions. Hence a wavelet-based parameteriza- 
tion is expected to provide better features. The improvement in the accuracy has 
already been seen in [11]. 
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